25 - Artificial Intelligence II [ID:47316]
50 von 864 angezeigt

Okay, let's start.

You can still hear me, right?

Okay, good.

So we've been talking about natural language as an area of AI, and in particular in kind

of in sync with our weak AI mandate, namely instead of going after general AI, which is

all the things that humans can do, we go after kind of little problems.

And that's called language, natural language processing.

We've seen a couple of those things and we've looked at language models.

And the idea there is that instead of having a true-false verdict on whether a string is

in the language or not, we make a probability distribution.

And basically that's an area called corpus linguistics.

The idea there is that rather than some person who knows English or German thinking deep

and saying, well, this is how it is in English or German, we just basically collect a corpus

of English and start counting.

Basically do it scientifically by data-driven research.

And what this gives is essentially a variety of models that are probability-based and they're

sampled from a corpus essentially by sophisticated counting.

And we've basically seen that if we think of a sequence of words as a Markov sequence,

not necessarily Markov chain but something higher order typically, then we can just basically

look at probabilities of small subsequences, unigrams, bigrams, trigrams.

Usually we're not going much higher than trigrams because things get big.

And so the things we can do with that is language identification, which is essentially if we

have a couple of, say, character trigram distributions, we'll just run them over our language sample

and see which one fits best.

Very simple, very effective, very useful.

Other things are genre classification, which is done that way, named entity recognition

is an important subtask.

And we've talked about that in a little bit of detail.

We can do word n-grams, but of course we have a data problem there.

Same with the internet, if we want to do word trigrams, we need a lot of data.

And so we also need to talk about words that we don't have in the vocabulary, which are

misspellings, new words, special words, things that are dialect words and so on that nobody

else knows, those kind of things.

And we essentially condense them down to kind of special tokens in the pre-processing step

and then we can deal with them.

Of course, if we're condensing lots of stuff into single tokens, then we're actually destroying

information.

And so there's a tension between simplifying a lot and maybe throwing away too much data

there.

Out of vocab words is not something that we have a lot of problems with in most languages

if we do character models.

At least when we have written language.

Using my handwriting, I'm sure there's lots of out of character characters, out of vocabulary

characters that you just aren't sure what they are.

But typically that's not what we're looking at.

But for words it's already a problem.

And the last thing we looked at was a measure.

That's an LLP.

It's very important to measure things.

How well are we doing?

Teil einer Videoserie :

Zugänglich über

Offener Zugang

Dauer

01:33:33 Min

Aufnahmedatum

2023-07-11

Hochgeladen am

2023-07-12 16:09:06

Sprache

en-US

Einbetten
Wordpress FAU Plugin
iFrame
Teilen